Part 1 ‐ Exploratory data analysis

The attached logins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15­ minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.

The data in 15-minutes format is to dense, so it needs to be resampled to properly visually identify any parterns. It is to resample the data daily and plotting the total logins per day as the figure below shows.

image.png

The daily plot shows that activity tends to increase at the end of each week without exception and that it drastically drops the following week to increase again. The exception to this rule is the third week of March 1970 where most of the daily counts are relatively high.

The figure below represents the second week of January 1970 with an hourly resampling of the data.

image.png

Daily aggregation shows that daily there are picks of demand at noon and midnight and that most of the demand is during the weekends.

Part 2 ‐ Experiment and metrics design

The neighboring cities of Gotham and Metropolis have complementary circadian rhythms: on weekdays, Ultimate Gotham is most active at night, and Ultimate Metropolis is most active during the day. On weekends, there is reasonable activity in both cities.

However, a toll bridge, with a two way toll, between the two cities causes driver partners to tend to be exclusive to each city. The Ultimate managers of city operations for the two cities have proposed an experiment to encourage driver partners to be available in both cities, by reimbursing all toll costs.

1. What would you choose as the key measure of success of this experiment in encouraging driver partners to serve both cities, and why would you choose this metric? For this experiment I would choose as a key measure of success the average number of times the toll bridge is being used by driver. An increase in that number will mean that reimbursing the toll cost is good initiative to encourage drivers to work between cities. However, we need to take into account the complementary circadian rythms of both cities. So, when calculating the averages, I will calculate an average for weekdays and another one for weekends.

2. Describe a practical experiment you would design to compare the effectiveness of the proposed change in relation to the key measure of success. Please provide details on:
a. how you will implement the experiment I will divide randomly the drivers population in two, A and B. For froup A the toll costs will be reimbursed (this the change the company wants to introduce), and for group B nothing will be reinbursed. Then for each user of each group I will record the number of times the toll bridge is used. As mentioned before, weekdays will be recorded separately from weekends.

b. what statistical test(s) you will conduct to verify the significance of the observation The experiment corresponds to an AB test. For each population we are interested in the average number of times the toll bridge was used per day. So if there is a difference between the average of both populations, we want to determine if that difference is statistically significant. For this purpose I will implement a z-test where I will evaluate the statistical significance of the difference in average between both populations.

Given the difference in behavior between weekdays and weekends, I will perform the test for weekdays and weekends separately.

c. how you would interpret the results and provide recommendations to the city operations team along with any caveats. If a statisticall significant increase in the toll bridge usage is observed, I will recommend the city operations team to go with the changes. However, as mentioned before I will determine first if there is a difference between weekdays and weekends. If there is, that information will be provided too so the toll costs are reimboursed only when the bridge is used the appropriate days. For example it may make sense to reimburse the costs during the weekends but not during weekdays. Having saying this, it may also be useful when analyzing the data to conduct the statistical test by the hour to determine if the toll bridge usage is also time dependent particularly in weekdays.

Part 3 ‐ Predictive modeling

Ultimate is interested in predicting rider retention. To help explore this question, we have provided a sample dataset of a cohort of users who signed up for an Ultimate account in January 2014. The data was pulled several months later; we consider a user retained if they were “active” (i.e. took a trip) in the preceding 30 days.

We would like you to use this data set to help understand what factors are the best predictors for retention, and offer suggestions to operationalize those insights to help Ultimate.

The data is in the attached file ultimate_data_challenge.json. See below for a detailed description of the dataset. Please include any code you wrote for the analysis and delete the dataset when you have finished with the challenge.

Data description

city: city this user signed up in

phone: primary device for this user

signup_date: date of account registration; in the form ‘YYYY MM DD’

last_trip_date: the last time this user completed a trip; in the form ‘YYYY MM DD’

avg_dist: the average distance in miles per trip taken in the first 30 days after signup

avg_rating_by_driver: the rider’s average rating over all of their trips

avg_rating_of_driver: the rider’s average rating of their drivers over all of their trips

surge_pct: the percent of trips taken with surge multiplier > 1

avg_surge: The average surge multiplier over all of this user’s trips

trips_in_first_30_days: the number of trips this user took in the first 30 days after signing up

ultimate_black_user: TRUE if the user took an Ultimate Black in their first 30 days; FALSE otherwise

weekday_pct: the percent of the user’s trips occurring during a weekday

1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?

avg_rating_by_driver', 'avg_rating_of_driver' and 'phone' have null values. However the rows containing those null fields may still contain usefull information. The average value for the ratings tends to be very good, so for 'avg_rating_by_driver' and 'avg_rating_of_driver' the null values are changed their mean values which are 4.78 and 4.608 respectively. For the 'phone' collumns, the null values will be replaced by the string 'other'. The column 'ultimate_black_user' is converted from boolean to int. 'last_trip_date' and 'signup_date' are turned to datetime type.

An inspection of the pairwise correlations bewteen numerical collumns doesn't show a strong correlation between variables. The strongest correlation is bewteen 'avg_surge' and 'surge_pct' with a value of 0.79. This indicates that practically all the columns can be considered as independent variables for future analysis. This confirmed by the matrix correlation plots shown below.

image.png

The diagonal is the distribution of values for each variable. It shows that most of the variables values are squeezed in one direction. When we train the models, this may make the models to underperform, so it will may be better to rescale the variables which are squeezed and cover a long range of values. But this decision will be done after the model to train is chosen.

To calculate the fraction of retained users, given that the exact date where the data was retrieved is not provided, the most recent date in the column 'last_trip_date' is considered as the reference to estimate user retention. Retained users are those who have completed a trip in the preceding 30 days with respect to the reference date. Then, up to the most recent trip date in the data set, the fraction of retained users is 37.61%.

2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.

The problem at hand is a binary classification. For this kind of situations SVM, logistic regressor classifier and decision trees are good options. However we want to interpret the results at in addition to having a reliable prediction. This eliminates SVM, given that despite its high accuracy, its results are harder to interpret. Between the logistic regressor and the decision tree, I decided to use a decision tree given that the interpretation of its results is easier to analyze. But with decision trees it's important to be careful because it is easier to overfit the model. So, among the decision tree models I chose a random forest classifier given that it prevents overfitting and reduces bias more easily. Given that we are using a decision tree model, then it is not necessary to rescale the features. Duting training, the model will be able to make the decisions properly with the data as it is.

In this problem we care about identifying properly the users that are active in the 6th month after subscription. A user active after 6 months is a positive class or 1, and an inactive user is a negative class or 0. Given that the data set is unbalanced with the positive classes representing only about 37% of the data, then the most appropriate metric to evaluate the model is the precision. Precision is the fraction of correctly detected positive targets among all the targets detected as positive.

The best hyper-paramters for the random forest model were carefully determined performing several grid-searches with cross-validation in order to avoid overfitting of the training data and obtain a good performance in the test data. The following metrics were obtained in the test data:

Accuracy: 78.44%

Precision for positive classes: 74.81%

3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long-term rider retention (again, a few sentences will suffice).

The most import aspect ensuring driver retention is 'avg_rating_by_driver'. So Ultimate.Inc needs to make sure that user experience is up to expectations, maybe even by provding some training to the drivers in customer interaction.

People from King's Landing are more likely to be retained than people from the other cities. Similarly, iPhone owners are more likely to be retained than Android users. iPhone users are associated with higher income, so I believe people from King's Landing are also more wealthy than the ones from other cities. So, to increase user retention from people with less income, special discounts may be proposed to Android users and unhabitants from Winterfell and Astapor.

Given that the number of trips in the first 30 days plays an important role in user retention, during the first month after subscription special offers can be created to encourage service use during that period.

Finally, given that 'weekday_pct' is also an important factor in user retention, Ultimate may introduce incentives encouraging the number of trips per week or even a system of points per miles spent with the service.

Code

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

Part 1 ‐ Exploratory data analysis

The attached l ogins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15­ minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.

Import data

In [2]:
# load json as string
# json.load((open('./logins.json')))
logins_df = pd.read_json('./logins.json')
In [3]:
logins_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 93142 entries, 0 to 93141
Data columns (total 1 columns):
login_time    93142 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 1.4 MB
In [4]:
logins_df.head()
Out[4]:
login_time
0 1970-01-01 20:13:18
1 1970-01-01 20:16:10
2 1970-01-01 20:16:37
3 1970-01-01 20:16:36
4 1970-01-01 20:26:21
In [5]:
logins_df.describe()
Out[5]:
login_time
count 93142
unique 92265
top 1970-02-12 11:16:53
freq 3
first 1970-01-01 20:12:16
last 1970-04-13 18:57:38

Aggregate data every 15 minutes

In [6]:
# Set time stamps as index
logins_df.index = logins_df.login_time
# Aggregate time every 15 minutes
login_agg = logins_df.resample('15min').count()
login_agg.columns = ['15_min_count']
print(login_agg.head())
# Aggregation sum must be equal to len of original data frame
login_agg['15_min_count'].sum() == len(logins_df)
                     15_min_count
login_time                       
1970-01-01 20:00:00             2
1970-01-01 20:15:00             6
1970-01-01 20:30:00             9
1970-01-01 20:45:00             7
1970-01-01 21:00:00             1
Out[6]:
True

Plot data

In [7]:
fig = login_agg.plot(figsize=(20, 10), fontsize=20)
fig.legend(["15 minutes count"])
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("15 Minutes Count", size=20)
_ = plt.show()

The data is to dense, it needs to be resampled to properly visually identify any parterns. Maybe it would be helpful to resample the data daily and plotting the total logins per day.

In [8]:
hourly = login_agg.resample("D").sum()
_ = hourly.plot(figsize=(20, 10), fontsize=20)
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("Daily Count", size=20)
_ = plt.show()

The daily plot shows that activity tends to increase at the end of each week without exception and that it drastically drops the following week to increase again. The exception to this rule is the third week of March 1970 where most of the daily counts are relatively high.

Let's now see what happens daily by resampling the data hourly and choosing a random week.

In [9]:
hourly = login_agg.resample("H").sum()
_ = hourly[74:288].plot(figsize=(20, 10), fontsize=20)
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("Hourly Count", size=20)
_ = plt.show()

Daily aggregation shows that daily there are picks of demand at noon and midnight and that most of the demand is during the weekends.

Part 3 ‐ Predictive modeling

1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?

In [10]:
# ultimate_df = pd.read_json('./ultimate_data_challenge.json')%%!
df = pd.DataFrame(json.load((open('./ultimate_data_challenge.json'))))
df.head()
Out[10]:
avg_dist avg_rating_by_driver avg_rating_of_driver avg_surge city last_trip_date phone signup_date surge_pct trips_in_first_30_days ultimate_black_user weekday_pct
0 3.67 5.0 4.7 1.10 King's Landing 2014-06-17 iPhone 2014-01-25 15.4 4 True 46.2
1 8.26 5.0 5.0 1.00 Astapor 2014-05-05 Android 2014-01-29 0.0 0 False 50.0
2 0.77 5.0 4.3 1.00 Astapor 2014-01-07 iPhone 2014-01-06 0.0 3 False 100.0
3 2.36 4.9 4.6 1.14 King's Landing 2014-06-29 iPhone 2014-01-10 20.0 9 True 80.0
4 3.13 4.9 4.4 1.19 Winterfell 2014-03-15 Android 2014-01-27 11.8 14 False 82.4
In [11]:
df.describe()
Out[11]:
avg_dist avg_rating_by_driver avg_rating_of_driver avg_surge surge_pct trips_in_first_30_days weekday_pct
count 50000.000000 49799.000000 41878.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 5.796827 4.778158 4.601559 1.074764 8.849536 2.278200 60.926084
std 5.707357 0.446652 0.617338 0.222336 19.958811 3.792684 37.081503
min 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000
25% 2.420000 4.700000 4.300000 1.000000 0.000000 0.000000 33.300000
50% 3.880000 5.000000 4.900000 1.000000 0.000000 1.000000 66.700000
75% 6.940000 5.000000 5.000000 1.050000 8.600000 3.000000 100.000000
max 160.960000 5.000000 5.000000 8.000000 100.000000 125.000000 100.000000
In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
avg_dist                  50000 non-null float64
avg_rating_by_driver      49799 non-null float64
avg_rating_of_driver      41878 non-null float64
avg_surge                 50000 non-null float64
city                      50000 non-null object
last_trip_date            50000 non-null object
phone                     49604 non-null object
signup_date               50000 non-null object
surge_pct                 50000 non-null float64
trips_in_first_30_days    50000 non-null int64
ultimate_black_user       50000 non-null bool
weekday_pct               50000 non-null float64
dtypes: bool(1), float64(6), int64(1), object(4)
memory usage: 4.2+ MB

'avg_rating_by_driver', 'avg_rating_of_driver' and 'phone' have null values. However the rows containing those null fields may still contain usefull information. The average value for the ratings tends to be very good, so for 'avg_rating_by_driver' and 'avg_rating_of_driver' the null values are changed their mean values which are 4.78 and 4.608 respectively. For the 'phone' columns, the null values will be replaced by the string 'other'. The column 'ultimate_black_user' is converted from boolean to int. 'last_trip_date' and 'signup_date' are turned to datetime type.

In [13]:
# Copy df to start modifying data set
data = df.copy(deep=True)

# Fill null values
data.avg_rating_by_driver.fillna(data.avg_rating_by_driver.mean(), inplace=True)
data.avg_rating_of_driver.fillna(data.avg_rating_of_driver.mean(), inplace=True)
data.phone.fillna('other', inplace=True)

# Change boolean to int
data.ultimate_black_user = data.ultimate_black_user.astype(int)

# Turn time columns to datetime format
data.last_trip_date = pd.to_datetime(data.last_trip_date, infer_datetime_format=True)
data.signup_date = pd.to_datetime(data.signup_date, infer_datetime_format=True)

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
avg_dist                  50000 non-null float64
avg_rating_by_driver      50000 non-null float64
avg_rating_of_driver      50000 non-null float64
avg_surge                 50000 non-null float64
city                      50000 non-null object
last_trip_date            50000 non-null datetime64[ns]
phone                     50000 non-null object
signup_date               50000 non-null datetime64[ns]
surge_pct                 50000 non-null float64
trips_in_first_30_days    50000 non-null int64
ultimate_black_user       50000 non-null int64
weekday_pct               50000 non-null float64
dtypes: datetime64[ns](2), float64(6), int64(2), object(2)
memory usage: 4.6+ MB

Display correlation matrix bewteen numerical variables.

In [14]:
data.corr()
Out[14]:
avg_dist avg_rating_by_driver avg_rating_of_driver avg_surge surge_pct trips_in_first_30_days ultimate_black_user weekday_pct
avg_dist 1.000000 0.079793 0.028508 -0.081491 -0.104414 -0.136329 0.032310 0.101652
avg_rating_by_driver 0.079793 1.000000 0.101660 0.010498 0.019964 -0.039097 0.009328 0.020366
avg_rating_of_driver 0.028508 0.101660 1.000000 -0.021653 -0.003290 -0.011060 -0.001916 0.012587
avg_surge -0.081491 0.010498 -0.021653 1.000000 0.793582 -0.001841 -0.078791 -0.110071
surge_pct -0.104414 0.019964 -0.003290 0.793582 1.000000 0.005720 -0.106861 -0.144918
trips_in_first_30_days -0.136329 -0.039097 -0.011060 -0.001841 0.005720 1.000000 0.112210 0.050388
ultimate_black_user 0.032310 0.009328 -0.001916 -0.078791 -0.106861 0.112210 1.000000 0.035998
weekday_pct 0.101652 0.020366 0.012587 -0.110071 -0.144918 0.050388 0.035998 1.000000

An inspection of the pairwise correlations bewteen numerical collumns doesn't show a strong correlation between variables. The strongest correlation is bewteen 'avg_surge' and 'surge_pct' with a value of 0.79. This indicates that practically all the columns can be considered as independent variables for future analysis.

Pairwise correlation plot.

In [15]:
sns.pairplot(data, diag_kind="kde")
plt.show()

The diagonal is the distribution of values for each variable. It shows that most of the variables values are squeezed in one direction. When we train the models, this may make the models to underperform, so it will may be better to rescale the variables which are squeezed and cover a long range of values. But this decision will be done after the model to train is chosen.

To calculate the fraction of retained users, given that the exact date where the data was retrieved is not provided, the most recent date in the column 'last_trip_date' is considered as the reference to estimate user retention. Retained users are those who have completed a trip in the preceding 30 days with respect to the reference date.

In [16]:
ref_date = data.last_trip_date.max()
# Fraction of retained users
fraction = np.sum((ref_date - data.last_trip_date) <= '30 days') / len(data) * 100
print("Fraction of retained users: %0.2f%%"%(fraction))
Fraction of retained users: 37.61%

2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.

Using one-hot encoding for categorical columns

In [17]:
model_data = pd.get_dummies(data)
model_data.head()
Out[17]:
avg_dist avg_rating_by_driver avg_rating_of_driver avg_surge last_trip_date signup_date surge_pct trips_in_first_30_days ultimate_black_user weekday_pct city_Astapor city_King's Landing city_Winterfell phone_Android phone_iPhone phone_other
0 3.67 5.0 4.7 1.10 2014-06-17 2014-01-25 15.4 4 1 46.2 0 1 0 0 1 0
1 8.26 5.0 5.0 1.00 2014-05-05 2014-01-29 0.0 0 0 50.0 1 0 0 1 0 0
2 0.77 5.0 4.3 1.00 2014-01-07 2014-01-06 0.0 3 0 100.0 1 0 0 0 1 0
3 2.36 4.9 4.6 1.14 2014-06-29 2014-01-10 20.0 9 1 80.0 0 1 0 0 1 0
4 3.13 4.9 4.4 1.19 2014-03-15 2014-01-27 11.8 14 0 82.4 0 0 1 1 0 0

Defining targets and features set

In [18]:
# Features
X = model_data[['avg_dist', 'avg_rating_by_driver', 'avg_rating_of_driver', 'avg_surge', \
                'surge_pct', 'trips_in_first_30_days', 'ultimate_black_user', 'weekday_pct', \
                'city_Astapor', "city_King's Landing", 'city_Winterfell', \
                'phone_Android', 'phone_iPhone', 'phone_other']]
# Targets
y = ((ref_date - data.last_trip_date) <= '30 days').astype(int)

Develop model

The problem at hand is a binary classification. For this kind of situations SVM, logistic regressor classifier and decision trees are good options. However we want to interpret the results at in addition to having a reliable prediction. This eliminates SVM, given that despite its high accuracy, its results are harder to interpret. Between the logistic regressor and the decision tree, I decided to use a decision tree given that the interpretation of its results is easier to analyze. But with decision trees it's important to be careful because it is easier to overfit the model. So, among the decision tree models I chose a random forest classifier given that it prevents overfitting and reduces bias more easily.

In this problem we care about identifying properly the users that are active in the 6th month after subscription. A user active after 6 months is a positive class or 1, and an inactive user is a negative class or 0. Given that the data set is unbalanced with the positive classes representing only about 37% of the data, then the most appropriate metric to evaluate the model is the precision. Precision is the fraction of correctly detected positive targets among all the targets detected as positive.

In [19]:
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from time import time

# Definition of random seed
seed = 14
np.random.seed(14)
# Divide data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, \
                                                    shuffle=True, stratify=y)
# Model
model = RandomForestClassifier(random_state=seed, n_jobs=-1)
# Parameters for grid search
parameters = {'n_estimators': [16, 20, 25], \
              'max_depth': [75, 80], \
              'max_features': ['sqrt'], \
              'min_samples_leaf': [18, 19], \
              'min_samples_split': [10, 13, 15]}
# Defining cross validation object
cv = KFold(n_splits=5, shuffle=True, random_state=seed)
# Define grid search
grid_search = GridSearchCV(model, param_grid=parameters, cv=cv, scoring='precision')
# Perform grid search for best parameters
start = time()
grid_search.fit(X_train, y_train)
end = time()

print("Grid-search time: %.3f" %(end-start))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
Grid-search time: 70.606
Best score: 0.749
Best parameters set:
	max_depth: 75
	max_features: 'sqrt'
	min_samples_leaf: 18
	min_samples_split: 10
	n_estimators: 25
In [20]:
from sklearn.metrics import precision_score
# Model
model = RandomForestClassifier(max_depth= 75, \
                               max_features= 'sqrt', \
                               min_samples_leaf= 18, \
                               min_samples_split= 10, \
                               n_estimators= 25, \
                               random_state=seed) 
# Fit model
model.fit(X_train, y_train)
# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Metrics
precision_train = precision_score(y_train, y_train_pred)
precision_test = precision_score(y_test, y_test_pred)

print("Precision on training data: \n\t{}".format(precision_train))
print("Precision on test data: \n\t{}".format(precision_test))
Precision on training data: 
	0.7763568309419838
Precision on test data: 
	0.7480680061823802
In [21]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_curve, auc

confusion = confusion_matrix(y_test, y_test_pred)

accuracy = accuracy_score(y_test, y_test_pred)

precision, recall, f1_score, support = precision_recall_fscore_support(y_test, y_test_pred, beta=1)

# Predict probabilities, the second element of the predictions
# contains the probability of having a positive
y_test_prob = [prob[1] for prob in model.predict_proba(X_test)]

fpr, tpr, thresholds = roc_curve(y_test, y_test_prob)

auc_roc = auc(fpr, tpr)
# Plot ROC
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='CNN')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

print("Area under ROC: {:.4f}".format(auc_roc))
print("Confusion matrix:\n {}".format(confusion))
print("Accuracy: {:.2f}%".format(accuracy*100))
print("Precision for 0 and 1: {:.2f}% and {:.2f}%".format(precision[0]*100, precision[1]*100))
print("Recall for 0 and 1: {:.2f}% and {:.2f}%".format(recall[0]*100, recall[1]*100))
print("F1 score for 0 and 1: {:.2f}% and {:.2f}%".format(f1_score[0]*100, f1_score[1]*100))
Area under ROC: 0.8495
Confusion matrix:
 [[5424  815]
 [1341 2420]]
Accuracy: 78.44%
Precision for 0 and 1: 80.18% and 74.81%
Recall for 0 and 1: 86.94% and 64.34%
F1 score for 0 and 1: 83.42% and 69.18%

3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long­ term rider retention (again, a few sentences will suffice).

Let's interpret the result by analyzing the features importance and vizualizing one of the trees.

In [27]:
feature_importance = model.feature_importances_ 
features = X.columns
print("Feature importance:")
for i in range(len(features)):
    print("\t X_{} -- {}:               {:.5f}".format(i, features[i], feature_importance[i]))
Feature importance:
	 X_0 -- avg_dist:               0.05801
	 X_1 -- avg_rating_by_driver:               0.20025
	 X_2 -- avg_rating_of_driver:               0.02510
	 X_3 -- avg_surge:               0.10120
	 X_4 -- surge_pct:               0.09224
	 X_5 -- trips_in_first_30_days:               0.08427
	 X_6 -- ultimate_black_user:               0.05711
	 X_7 -- weekday_pct:               0.10320
	 X_8 -- city_Astapor:               0.02803
	 X_9 -- city_King's Landing:               0.13443
	 X_10 -- city_Winterfell:               0.01860
	 X_11 -- phone_Android:               0.04733
	 X_12 -- phone_iPhone:               0.05007
	 X_13 -- phone_other:               0.00016
In [23]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(model.estimators_[24], out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.759357 to fit

Out[23]:

The most import aspect ensuring driver retention is 'avg_rating_by_driver'. So Ultimate.Inc needs to make sure that user experience is up to expectations, maybe even by provding some training to the drivers in customer interaction.

People from King's Landing are more likely to be retained than people from the other cities. Similarly, iPhone owners are more likely to be retained than Android users. iPhone users are associated with higher income, so I believe people from King's Landing are also more wealthy than the ones from other cities. So, to increase user retention from people with less income, special discounts may be proposed to Android users and unhabitants from Winterfell and Astapor.

Given that the number of trips in the first 30 days plays an important role in user retention, during the first month after subscription special offers can be created to encourage service use during that period.

Finally, given that 'weekday_pct' is also an important factor in user retention, Ultimate may introduce incentives encouraging the number of trips per week or even a system of points per miles spent with the service.